With our data visualization we are determining the association between age and fitness based on running data from the Cherry Blossom Ten-mile Run held in Washington DC from 1973 to 2020.
| Variable Names | Data Type | Variable Descriptions |
|---|---|---|
| Year | Integer | Year the race was held. |
| Name | Character | First and last name of runner. |
| Age | Integer | Age of runner at time of race. He |
| Time | Time/Numeric | Time in hr:min:sec format to run 10 miles. |
| Division | Character | Groupings based on age and gender. |
| pos_by_sex | Integer | |
| total_by_sex | Integer | |
| Sex | Character | Gender of runner. |
| PRCP | Numeric | |
| TMAX | Integer | Temperature maximum for the race day |
| TMIN | Integer | Temperature minimum for the race day |
In the original data set we have 347402 rows and 17 columns. After cleaning the data set we ended up with 339934 rows and 11 columns. 7468 rows of data were omitted from the data we used because they had missing values for the time and/or age variables. Below is the description of the variables and data we excluded for our data analysis/visualization:
| What was excluded | Reason for exclusion |
|---|---|
| Hometown | |
| Distance | |
| Date | |
| pos_by_div | |
| total_by_division | |
| Pace | |
| Year 1977? |
Year, Age, Time, Sex main variables to focus on.
Checklist for this section:
summary stats: mean, median, mode, range, sd, percentiles, distributions by sex variable, etc.
mention how many women and how many men in each year and overall
summary.data.frame(df)
## Year Name Age Time
## Min. :1974 Length:339214 Min. : 8.0 Min. :00:43:20
## 1st Qu.:2001 Class :character 1st Qu.:29.0 1st Qu.:01:19:35
## Median :2009 Mode :character Median :35.0 Median :01:30:50
## Mean :2006 Mean :36.6 Mean :01:31:25
## 3rd Qu.:2015 3rd Qu.:43.0 3rd Qu.:01:42:22
## Max. :2019 Max. :87.0 Max. :02:20:00
##
## Division pos_by_sex total_by_sex Sex
## Length:339214 Min. : 1 Min. : 27 Length:339214
## Class :character 1st Qu.: 1109 1st Qu.: 3513 Class :character
## Mode :character Median : 2445 Median : 6792 Mode :character
## Mean : 3134 Mean : 6298
## 3rd Qu.: 4739 3rd Qu.: 9030
## Max. :11042 Max. :11042
## NA's :6 NA's :6
## PRCP TMAX TMIN
## Min. :0.0000 Min. :44.0 Min. :32.00
## 1st Qu.:0.0000 1st Qu.:56.0 1st Qu.:39.00
## Median :0.0000 Median :64.0 Median :43.00
## Mean :0.0538 Mean :63.3 Mean :43.11
## 3rd Qu.:0.0500 3rd Qu.:70.0 3rd Qu.:47.00
## Max. :0.9300 Max. :84.0 Max. :58.00
##
These were helping me evaluate the data cleaning, we can fix or replace them later